chore(spanner): optimize lock contention and skipped tablet reporting#12719
chore(spanner): optimize lock contention and skipped tablet reporting#12719
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous cache updates in ChannelFinder and KeyAwareChannel to improve performance, and enhances EndpointLifecycleManager to track and report transient failure evictions. Additionally, it refactors KeyRangeCache to use ReentrantReadWriteLock for better concurrency. My review highlights critical issues: the addRanges method in KeyRangeCache is not exception-safe and risks reference count leaks, the readLock release in fillRoutingHint creates a race condition that requires CachedGroup to be fully synchronized, and the use of Executors.newCachedThreadPool in ChannelFinder poses a risk of resource exhaustion.
...panner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/KeyRangeCache.java
Show resolved
Hide resolved
| readLock.lock(); | ||
| try { | ||
| targetRange = findRangeLocked(key, hintBuilder.getLimitKey(), rangeMode); | ||
| } finally { | ||
| readLock.unlock(); |
There was a problem hiding this comment.
Releasing the readLock before calling targetRange.group.fillRoutingHint() introduces a race condition because addRanges now performs group updates outside the global lock. Since CachedGroup relies on monitor-based synchronization (as seen in getActiveAddresses at line 180), both CachedGroup.update and CachedGroup.fillRoutingHint must be synchronized to ensure thread safety. Without this, concurrent updates and reads of the group's internal state (like tablets and leaderIndex) could lead to inconsistent behavior or data corruption.
There was a problem hiding this comment.
CachedGroup.update() is synchronized and CachedGroup.fillRoutingHint() now synchronizes on the same monitor
...panner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/ChannelFinder.java
Outdated
Show resolved
Hide resolved
| for (Set<String> addresses : activeAddressesPerFinder.values()) { | ||
| allActive.addAll(addresses); | ||
| } | ||
| transientFailureEvictedAddresses.retainAll(allActive); |
There was a problem hiding this comment.
This call is synchronized on activeAddressLock. However, the probe method also adds to this set without holding a lock. This means that this call (and also the add call from probe) are not guaranteed to be atomic, and that the outcome is undefined. It won't cause any exceptions, as it is a ConcurrentHashSet, so if we are fine with this being slightly undeterministic, then this is fine. Otherwise, we should take the same lock in probe as well as here.
| } | ||
|
|
||
| @VisibleForTesting | ||
| void awaitPendingUpdates() throws InterruptedException { |
There was a problem hiding this comment.
Could we move this method to src/test/java? Or otherwise add a comment that it should only be called from test code? (The @VisibleForTesting annotation is intended to indicate that a method has higher visibility than otherwise needed to be able to test, it does not indicate that it should only be invoked from test code.)
Summary
synchronizedwithReadWriteLockinKeyRangeCacheto reduce lock contention on the hot path. Read operations (findServer,getActiveAddresses,size,debugString) now take a read lock, allowing concurrent lookups. Group updates are performed outside the write lock to minimize critical section duration.ChannelFinderviaupdateAsync(), using a sequential executor backed by a shared daemon thread pool. This preventsCacheUpdateprocessing from blocking the gRPC response listener thread.EndpointLifecycleManagertracks addresses evicted for TRANSIENT_FAILURE andKeyRangeCacheincludes their tablet UIDs inskipped_tablets, giving the server better routing visibility.CachedRange.lastAccessasvolatilefor safe cross-thread reads.